CS418 Project1 - Exploratory Data Analysis

Find the Project Description here.

This project is done as part of CS418 - Introduction to DataScience at UIC.

Load Dataset

1. (5 pts.) Reshape dataset election_train from long format to wide format. Hint: the reshaped dataset should contain 1205 rows and 6 columns.

2. Merge reshaped dataset election_train with dataset demographics_train. Make sure that you address all inconsistencies in the names of the states and the counties before merging. Hint: the merged dataset should contain 1200 rows.

3. (5 pts.) Explore the merged dataset. How many variables does the dataset have? What is the type of these variables? Are there any irrelevant or redundant variables? If so, how will you deal with these variables?

Merged Dataset Summary:

4. (10 pts.) Search the merged dataset for missing values. Are there any missing values? If so, how will you deal with these values?

Missing values information : There 3 rows with missing values for 'Democratic' votes and 2 rows with missing values for 'Republican' votes. Additionally, 680 rows are present with 'Citizen Voting-Age Population' as 0. These can be considered as a missing value.

Dealing with Missing values: The total 5 rows with missing votes for one party should be deleted as it can influence the result but both the party votes are unknown. Since there are 680 rows missing in the 'Citizen Voting-Age Population', it can be best resolved by deleting the feature from the data-frame

5. (5 pts.) Create a new variable named “Party” that labels each county as Democratic or Republican. This new variable should be equal to 1 if there were more votes cast for the Democratic party than the Republican party in that county and it should be equal to 0 otherwise

6. (10 pts.) Compute the mean median household income for Democratic counties and Republican counties. Which one is higher? Perform a hypothesis test to determine whether this difference is statistically significant at the 𝜶 = 𝟎. 𝟎𝟓 significance level. What is the result of the test? What conclusion do you make from this result?

Mean 'Median Household Income' of Democratic Counties are higher than Republican Counties.

Null Hypotheses: Median Household Income of Democratic counties is equal to Republican counties.(μd==μr)
Alternative Hypotheses: Median Household Income of Democratic counties are higher than Republican counties. (μd>μr)

We do a t-test on the data since population standard deviation is unknown.
We do a right tailed t-test since the alternative hypothesis is μd>μr

The p value for the Null hypothesis is 3.5710^-8 which is way lesser than the significance level 0.05.
Hence, we reject the null hypothesis and there is sufficient evidence to conclude that Median Household Income of Democratic counties may be higher than that of the republican ones*

7. (10 pts.) Compute the mean population for Democratic counties and Republican counties. Which one is higher? Perform a hypothesis test to determine whether this difference is statistically significant at the 𝜶=𝟎.𝟎𝟓 significance level. What is the result of the test? What conclusion do you make from this result?

Mean 'Total Population' of Democratic Counties are higher than Republican Counties.

Null Hypotheses: Mean population of Democratic counties is equal to Republican counties.(μd==μr)
Alternative Hypotheses: Mean population of Democratic counties are higher than Republican counties. (μd>μr)

We do a t-test on the data since population standard deviation is unknown.
We do a right tailed t-test since the alternative hypothesis is μd>μr

The p value for the Null hypothesis is 1.02410^-14 which is way lesser than the significance level 0.05.
Hence, we reject the null hypothesis and there is sufficient evidence to conclude that mean population of Democratic counties maybe higher than that of the republican ones*

8. (20 pts.) Compare Democratic counties and Republican counties in terms of age, gender, race and ethnicity, and education by computing descriptive statistics and creating plots to visualize the results. What conclusions do you make for each variable from the descriptive statistics and the plots?

Gender:

Age:

Race and Ethinicity:

Education:

Gender and Voting Patters:

Removing Redundant Computed Values:

9. (5 pts.) Based on your results for tasks 6-8, which variables in the dataset do you think are more important to determine whether a county is labeled as Democratic or Republican? Justify your answer.

According to the results from tasks 6-8, the 'Total Population' of a county and Education Level('Bachelor degree or higher' 'less than bachelor degree' and 'less than high school degree') are more important to determine whether a county is labeled as Democratic or Republican.

10. (10 pts.) Create a map of Democratic counties and Republican counties using the counties’ FIPS codes and Python’s Plotly library. Note that this dataset does not include all United States counties.